@*
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*@
<p>
    This heuristic gauges your mapper performance in a disk I/0 perspective. Mapper spill ratio (spilled records/output
    records) is a critical indicator to your mapper performance: if the ratio is close to 2, it means each record is
    spilled to disk twice(once when in-memory sort buffer is almost full, once when merging spilled splits). This
    usually happens when your mappers have large amount of outputs. If the ratio is higher than 2, we suggest you try
    our recommendation below. Having large disk I/O wasted on sorting output records could seriously affect your mapper
    speed. To make it run faster, try our tentative recommendation. We newly enabled this heuristic and want you to test
    it! <br>
</p>
<p>

</p>
<h5>Example</h5>
<p>
<div class="list-group">
    <a class="list-group-item list-group-item-danger" href="">
        <h4 class="list-group-item-heading">Mapper Spill</h4>
        <table class="list-group-item-text table table-condensed left-table">
            <thead><tr><th colspan="2">Severity: Critical</th></tr></thead>
            <tbody>
            <tr>
                <td>Number of spilled records</td>
                <td>20000</td>
            </tr>
            <tr>
                <td>Number of Mapper output records</td>
                <td>8000</td>
            </tr>
            <tr>
                <td>Ratio of disk spills to output records</td>
                <td>2.5</td>
            </tr>
            </tbody>
        </table>
    </a>
</div>
</p>
<h3>Suggestions</h3>
<p>
This heuristic is less straightforward than others, and it requires deeper hadoop knowledge. We are still working on finalizing the recommendation. Feedbacks welcomed! You could try:
<ol>
<li> Increase the size of in-memory sort buffer (mapreduce.task.io.sort.mb), default 100M</li>
<li> Increase the buffer spill percentage (mapreduce.map.sort.spill.percent, when it is reached a background thread will start spill buffer to disk), default value is 0.8.</li>
<li> Use combiner to lower the map output size. </li>
<li> Compress mapper output (set mapreduce.map.output.compress and mapreduce.map.output.compress.codec)</li>
</ol>
</p>